Testing for Benford’s Law in very small samples: Simulation study and a new test proposal

Benford’s Law defines a statistical distribution for the first and higher order digits in many datasets. Under very general condition, numbers are expected to naturally conform to the theorized digits pattern. On the other side, any deviation from the Benford distribution could identify an exogenous modification of the expected pattern, due to data manipulation or even fraud. Many statistical tests are available for assessing the Benford conformity of a sample. However, in some practical applications, the limited number of data to analyze may raise questions concerning their reliability. The first aim of this article is then to analyze and compare the behavior of Benford conformity testing procedures applied to very small samples through an extensive Monte Carlo experiment. Simulations will consider a thorough choice of compliance tests and a very heterogeneous selection of alternative distributions. Secondly, we will use the simulation results for defining a new testing procedure, based on the combination of three tests, that guarantees suitable levels of power in each alternative scenario. Finally, a practical application is provided, demonstrating how a sounding testing Benford compliance test for very small samples is important and profitable in anti-fraud investigations.


Introduction
Benford's Law (BL) defines a probability distribution for patterns of significant digits in numerical data. Its formulation is grounded on the intriguing observation made first by Newcomb [1], then by Benford [2], who noticed a non-uniform amounts of wear in the pages of the logarithmic tables. In its complete form, the law states that the leading digits of many natural phenomena are not uniformly distributed, as one may expect, but follow a logarithmic distribution: . . . ; D m ðXÞ ¼ d m � ¼ log 10 where D j (x) is the j-th significant digit of a positive real number x, d 1  Theoretical studies investigated the properties and provided a limit theorem for the digit distribution. Classical references for this topic are [3][4][5][6]. At the same time, empirical applications have shown that many sets of numerical data are consistent with BL, at least in its simplest form Eq (2). Some examples in this sense are stock indexes [7], hydrology data [8] and volcanology data [9]. In addition, thanks to its generality and feasibility in different fields, the Benford distribution has fruitfully supported fraud investigations and the detection of manipulated data, in particular concerning COVID declared figures [10], scientific studies [11], media and social networks data [12] and international trade [13]. The assumption is that clean data (i.e. without any external manipulation) are distributed according to Eq (1). Generally, this condition is satisfied whenever numbers are the result of mathematical operations (multiplication, division, raising to power and so on) on values taken from different random variables, as in the case of accounting data [14]. The value of a purchase, for example, is the outcome of the multiplication between the number of items and their unitary price, which itself comes from the combination of different numbers. In such context, a significant departure from the theoretical distribution can point to data sets that include fabricated numbers. The identification of non-Benford numbers can rely on a variety of statistical tests. Their empirical properties may significantly differ, but we expect that their power, ceteris paribus, increases with the sample dimension. On the contrary, when the number of observations is very small, "there may be insufficient power to meaningfully detect or confirm conformance with the law" ( [15], page 2793). Nevertheless, in many practical applications, the number of figures available for each individual sample to test could be quite limited. In this case, the usual solution for increasing the expected sensitivity is to run the BL compliance tests on several individual samples merged together. Consider, for example, the investigations aimed to assess the digit distribution of the numbers published in scientific journals. Their analyses are usually not performed article by article, but they rather consider groups of articles pulled together (for example, those published in the same year, as in [16]). This strategy is not feasible however when the aim is to identify the specific individual sample that may contain irregularities. This is just the case of anti-fraud, where the target is to identify the economic operators that may have manipulated their declared numbers. Thus, in such contexts, the only strategy is to maximize the reliability of the expected outcome, that is to choose the testing procedure that, for a given significance level, guarantees the largest expected power against a wide range of possible alternatives.
In this article, we firstly analyze and compare the behavior of several testing procedures against a huge set of alternative scenarios. Besides considering a thorough choice of Benford compliance tests and a very heterogeneous selection of alternative distributions, the main novelty with respect to previous works with similar aim (as, for example [17][18][19]), is that we focus on samples with very small dimensions (i.e. n = 20). We do not expect to find a testing procedure that strictly dominates the other in all the different scenarios. The target is rather to find the ones that show the best general behavior independently of the alternative. Secondly, we use the testing performances obtained in the simulation exercise to derive and propose a combined test that guarantees suitable levels of power in each alternative scenario. The objective of this study is then twofold. From one side, we provide a comprehensive analysis of small samples properties of Benford's compliance tests that could support the researchers in the selection of the proper procedure. From the other, we propose an alternative method based on the combination of different tests that offers desirable performance against a wide range of possible deviations. An empirical application on international trade data shows how the availability of reliable BL testing procedures for very small samples can remarkably increase the range of applicability, and provide a valuable support for limiting the number of economic transactions that deserve further anti-fraud investigations.
The paper proceeds as follows. In the next section, the set of BL tests compared in the simulation exercise are briefly introduced, properly divided into three families: tests for the FSD, tests for the complete form of the BL and the summation test. This is followed by a section that describes the different alternative distributions of values considered in the Monte Carlo experiment. Therefore, the simulation results are presented and discussed. In addition, we used them to define a combined test with desirable empirical properties. We present an empirical application of Benford compliance tests on international trade data, stressing the importance of reliable methods for small samples in such context. The final section concludes.

Testing the BL conformity
Many testing procedures allow to assess the conformity of a set of values to the Benford's theory. In the first part of our study, we want to study their behavior in small samples through a huge simulation exercise. We took into account most of the testing procedures proposed in past investigations, in order to provide a complete comparison. They can be divided into three families: tests for the FSD, tests for the complete BL, and the summation tests recently proposed by [20].

Tests for the FSD
Given an n-sample of values {x 1 , . . ., x n }, we can assess the BL conformity of the FSD through the null hypothesis: where p k = n k /n, and n k ¼ P n i¼1 IðD 1 ðx i Þ ¼ kÞ. The first test is the well known Pearson goodness-of-fit test: that is asymptotically distributed as a w 2 8 . It is surely one of most used statistics in BL compliance experiments, even though its potential low power, especially in small samples, suggests some cautions [21]. In this simulation exercise, it could be considered as a benchmark for assessing the gain in power associated with the other testing procedures.
A second category of FSD tests derives from the Cramer-Von-Mises, Watson and Anderson-Darling statistics, which are mainly used for testing the goodness-of-fit for continuous distributions [22,23]. Let F k ¼ P k j¼1 p k and F B k ¼ P k j¼1 p B k denote the cumulative distributions of respectively the empirical and the expected proportions, whose difference is defined as , and use them for calculating the weighted mean of the cumulative distributions distances: � Z ¼ P 9 k¼1 w k Z k . The three statistics are defined as: Asymptotic critical values are available for the three tests [19].
Then, we will consider two tests based on the Kolmogorov-Smirnov (KS) distance of the cumulative FSD distributions. In particular, the first is simply the KS deviation: whereas the second is the Kuiper test [24], defined as: Finally, the last FSD test considered is the Mean Test introduced by [16] and based on the simple observation that, if the FSD is distributed according to Eq (2), then its expected value is equal to 3.440 and its variance to 6.057. The Mean Test is defined as: ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi ffi 6:057 p : ð8Þ

Tests for the complete form of the BL
The second family of tests applied in the simulation experiment does not limit the attention to the FSD, but considers the whole digit distribution (1). Defining the significand of a value y 2 R as SðyÞ ¼ 10 log 10 jyjÀ blog 10 jyjc , with byc ¼ maxfm 2 Z : m � yg representing the floor function [5,25], proved that: the BL conformity could be tested through the statistics based on the KS distance applied on the logarithm of the ordered significands. As before, we considered both the KS test: and the Kuiper test: Recently [26], showed that KS s test for uniformity provides very good results for small, medium size and even large records.
In addition, we also consider the Anderson-Darling test [27], defined as: Summation test. Finally, the last test considered is the summation test Q [20]. Starting from the definition of the significand previously introduced, we define According to these definitions and representing C = log 10 e, the summation test limited to the first-digit case is given by: where � Z ¼ ð � Z 1 ; . . . ; � Z 9 Þ 0 , μ is the nine elements vector μ = (C, . . ., C) 0 , and S is the (9 × 9) matrix with elements σ kk = C(k + 1/2 − C) and σ kj = −C 2 whenever k 6 ¼ j.

Description of the alternative distributions
The finite sample behavior of the eleven testing procedures introduced in the previous section are investigated in a simulation study. Table 1 presents all the alternatives distributions considered in the simulation experiment, together with the corresponding parameter space. In each simulation s, a sample X s ¼ fx s 1 ; . . . x s n g is generated according to one of the pattern listed in the Table 1. Actually, the first three distributions are alternative patterns only for the FSD of x. In this case, the remaining digits of each number are simulated according to the Benford probabilities, in order to allow a suitable calculation of the tests for the complete form of the BL and of the summation test.
The first alternative family for the FSD is the Generalized Benford's Law (GB, [28,29]): where k 2 {1, 2, . . ., 9} and y 2 R. Therefore, when θ = 0, the GB reduces to the standard Benford distribution for the FSD (2), whereas for θ = 1, the resulting distribution is uniform. This last case is quite interesting, since it mimics a manipulation strategy widely used in past applications [13,19], based on a naive assignation of the first digit with equal probabilities. Finally, when θ < 0, more weight is assigned to smaller digits (1 and 2), whereas when θ > 1, more weight is assigned to bigger digits (from 4 to 9, see panel a of Fig 1). Then, we considered the FSD patterns proposed by Rodriguez (R, [30]): ln ð10Þ þ k ln ðkÞ À ðk þ 1Þ ln ðk þ 1Þ where k 2 {1, 2, . . ., 9} and b 2 R. Again, the standard Benford distribution (2) and the uniform distribution of the FSD are two particular cases of this family. The former corresponds to β = −1, whereas the latter to β = ±1. In general, when β < 0 (β > 0), the digit probabilities are decreasing according to a convex (concave) pattern (see panel b of Fig 1).

PLOS ONE
with k 2 {1, 2, . . ., 9} and ρ > 0. When ρ = 1 or ρ = 2, the distribution of the FSD corresponds to (2), whereas for other values of the parameter, the resulting distribution has a convex Ushape (see panel c of Fig 1). The last two alternative distributions are the Lognormal (LN) and the Weibull (WB), two continuous random variables that may reasonably arise in many fields. Both of them can become very close to a Benford random variable for some combination of their parameters. Actually [4,32], showed that a LN random variable with a large shape parameter is practically indistinguishable from a Benford random variable. This is why, in the simulation exercise, we decided to fix the scale parameter μ to five different values, and to let the shape parameter σ vary from 0.1 to 1. Similar arguments apply also to the WB random samples. In this case, the simulated values closely fit the Benford distribution for small values of the shape parameter [6]. Again, we fixed scale parameter a to five different values, and to let the shape parameter b vary from 0.5 to 5. Panels d and e of Fig 1 provide a representation of the FSD probabilities associated to the LN and WB distributions when the shape parameter is respectively 0.5 and 3.
Therefore, the wide range of possible alternative distributions selected for the simulation exercise produces a heterogeneous set of patterns for the FSD probabilities. This allows an exhaustive assessment of the performance of the tests described in the previous section under very general contexts and scenarios.

Simulation results
Since the focus of our study is on small samples, for each alternative scenario we consider 4 sample dimensions, i.e. n = 20, 30, 40 and 50. In this section, we present only the results obtained with n = 20. The simulation outcomes for the other sample dimensions (available upon request) are numerically different, but confirm the patterns described for the smallest dimension. Even though the asymptotic critical values for some of the implemented tests are available, the small dimensions of the simulated samples recommends to use exact critical values. We then computed the 1% critical value for each test and each different sample size through one million simulations of Benford distributed values. For each alternative scenario listed in Table 1, we generate 100,000 simulations and calculate the rejection rates of each test.
The three alternative FSD patterns considered reduces to Eq (2) for particular values of its parameter. In these cases, rejection rates measure the empirical size of the tests. Table 2 presents the percentage rejection rates obtained when the FSD patterns corresponds exactly to Eq (2). The values range from 0.901% to 1.064% and confirm a suitable accuracy of the empirical size, not significantly different from the nominal value of 1%. Table 3 shows the rejection rates of the tests against the three alternative FSD for a subset of the parameters considered in the Monte Carlo exercise (as before, the full set of results is available upon request). The panel reserved to GB alternatives highlights that, independently of the value of θ, A 2 , W 2 , AD, KS d and MT seem to offer the best general performances, whereas the power of KS s is optimal when θ < 0, but deteriorates when θ > 0. All the other tests seem to offer rejection rates always below average. When the digits are uniformly distributed (i.e. θ = 1), the power of MT, W 2 and U 2 is slightly less than 60%, and the one of AD and KS d is close to 50%. All the other tests considered exhibit a power between 30 and 35%.

Power against alternative FSD patterns
The central panel of Table 3 presents the rejection rates obtained with FSD simulated according to Eq (14). A 2 , W 2 , AD, KS d and MT show again the best general properties independently of the value of the parameter, while the power of the remaining tests results smaller. The only exception is the case β = 1, where the top performers are U 2 and KU d .
Finally, the output against Hürlimann distributed digits is in the bottom panel of Table 3. In this case, the top performers are U 2 , KU d and KU s , whereas the power of χ 2 and Q, is optimal only when ρ < 1. Instead, the rejection rates of MT, KS s and W 2 are always below average.  Table 4 presents the results obtained with data simulated according to a lognormal distribution with the five selected scale parameters. As expected, the rejection rates are inversely related to the shape parameter. When σ decreases the rejection rates are close to 1, while when the shape approaches to 1, the rejection rates approach to the nominal level of the test. Even though every panel seems to tell a slightly different story, there are some testing procedures that seem to regularly offer desirable properties. In particular, they are KU s , U 2 and KU d . Also KS s and

PLOS ONE
KS d yield adequate results on average. On the other side of the ranking, what stands out are the poor performances offered by W 2 when μ = 0; by MT when μ = 0, 1 and 1.5; by AD when μ = 1; and by χ 2 and Q when μ = 0.5. Table 5 shows instead the rejection rates obtained with Weibull distributed values for fixed scale coefficients. As expected, now the rejection rates are directly related to the shape parameter. Thus, when b increases, the power approaches to 1. Again, the outstanding performers are KU s , U 2 and KU d , followed by KS s and KS d that also offer outcomes above the average. Particularly negative results are showed by MT when a = 0.5 and 1.5; and by χ 2 and Q when a = 2.

General comments
The comparison of different BL compliance tests in a simulated environment provided several important conclusions. First of all, it is not necessarily true that in very small samples there is insufficient power to meaningfully detect or confirm conformance with the BL. Simulations showed that a careful choice of the testing procedure allows to significantly increase the expected power. Consider, for example, the case of the uniform alternative. As previously stressed, this is the FSD distribution that we should expect when numbers are falsified by a

PLOS ONE
Testing for Benford's Law in very small samples manipulator that believes that the proportion of ones, twos,. . ., nines in the first digit should be equal. By choosing the usual χ 2 test, as most of the analysts do in practice, we can expect a power of 33.5%. But if we consider A 2 or MT the expected power almost doubles. Also in other scenarios, simulation results confirmed that the power of χ 2 is usually dominated by the one offered by alternative testing procedures. Thus, despite its popularity, the use of this test should be avoided, at least in very small samples. The other testing procedures considered in the simulation exercise showed very different characteristics, and there is not one that strictly dominates the others. It is also interesting to note that the statistics aimed to test the simplest null (2) work properly also when the deviation affects all the digits, and are not necessary dominated by their counterparts that consider the complete joint digits distribution. This is, for example, the case of U 2 , which is optimal not only when the FSD follows a Hürlimann distribution, but also when the simulated values are Weibull distributed with scale parameter equal to 0.5. However, it performs poorly when the FSD is simulated according to a Generalized Benford distribution with θ > 0 and to a Rodriguez distribution with β < −1. This sub-optimal behavior is common also to other testing procedures. Therefore, it is not possible to identify a test that is able to offer suitable expected performances independently of the alternative scenario. One alternative option could be to define a combined test that merges together the positive features offered by the single tests, as described in the following section.

Combining tests for improving the general performance
The simulation results provided an exhaustive of the small sample behavior of different BL conformity tests. In particular, two of them offered the best general results across the different alternative scenarios: KU s and U 2 . The idea is then to combine the p-values of these tests in order to derive a procedure that offers a desirable level of power in all the alternative scenarios proposed. However, both offered poor performances when the FSD was distributed according to the Generalized Benford and the Rodriguez distribution. Therefore, in the definition of the combined test, we decided to consider also the A 2 test, whose performances under the Generalized Benford and the Rodriguez alternatives are optimal. Among the possible choices for combining the p-values, we chose the minimum function (see [33] for more details on this issue). Thus, the statistics is given by: where π(�) is the p-value function, defined as 1 − F(�). As usual, the p-values of the single tests and the 1% critical values of the combined test Γ will be calculated through the one million simulations of Benford distributed numbers. Our expectation is that the performances of Γ represent a favorable compromise of the ones of KU s , U 2 and A 2 . Figs 2-4 provide an immediate assessment of the behavior of Γ with respect to the three single tests. In addition, they allow a direct comparison with the maximum and the minimum rejection rates obtained with all the tests considered in the simulation exercise. The first thing to notice is that the combined test is always the second best choice, independently of which of the three single tests was the best choice. Secondly, the power of Γ is always closer to the best among KU s , U 2 and A 2 , than to the worst. These two features allow the combined test to offer convincing outcomes even when one or two single tests perform below the average, Consider, for example, the case represented in PowerCombinedGB for θ > 0: the power of KU s , U 2 is very close to the absolute minimum, whereas the rejection rate of Γ is very close to the one of A 2 , which is the absolute maximum. In conclusion, test Γ provides a convenient trade-off of the behaviors of the three single tests in all the scenarios considered.

Empirical application on international trade data
In this section, we provide some examples of the application of the BL compliance tests on Customs data. The contrast of fraud in this context is crucial. The protection of the financial interests is indeed a fundamental task of the European Union (EU) administration, and the  collection of customs duties on imports represents the principal source of traditional own resources of the EU budget [34]. In 2017, the EU's revenue from customs duties was more than 20 million, representing almost 15% of EU total revenue (source: https://ec.europa.eu/ budget/library/biblio/publications/2018/financial-report_en.pdf). Under-reporting the value of the imports is usually the main strategy pursued to pay less duties or excises, or to evade import restrictions and certain anti-dumping measures [35]. Proper statistical analyses provide support for the identification of anomalous trades that may result from unfair commercial strategies [36,37]. Several studies have shown that statistical methods based also on the Benford Law can be profitably used in this field, aiming to spot traders with more than 50 transactions that are suspected of data manipulation [13,20,38]. The possibility to extend the focus also to economic operators with less imports can significantly increase the feasibility of the anti-fraud analysis, especially when the investigations are limited to particular periods, or to particular groups of products. Just to have an idea, Table 6 shows the distribution of traders according to the number of declared imports during the whole 2014 in a single Member State of the European Union (not revealed for confidentiality issues). Considering that one year is the usual period length of Customs audit, it is evident how extending the focus also to less active importers allows to almost double the number of economic operators we can monitor. In addition, a reliable and accurate testing procedure for small samples allows to tighten the number of suspicious imports to investigate, reducing the costs of the demanding and time consuming controls necessary to prove the eventual fraud.
Consider, for instance, the two examples of the application of the BL compliance tests represented in Table 7. Data were collected in the context of a specific operation by a Member State of the EU and, for confidentiality reasons, the two traders will be labeled simply as Trader 1 and Trader 2. Trader 1 collected 58 imports during three years. The p-values of the tests a suggest a potential departure of the values declared from the theoretical distribution. Most of the p-values are indeed very close or even smaller than 1%, with only one of them larger than 5%. A year-by-year analysis reveals that the 21 import values declared in 2015 yield the smallest p-value, and should therefore be the first to investigate in the search for frauds. Actually, one of those 21 imports was checked by Custom authority, and the corresponding import resulted to be under-valuated. The analysis of Trader 2 provides another interesting pattern. Also in this case, the 94 imports were traded in a three years period, and the p-values of all the tests suggest BL compliance of the declared values. However, the year-by-year analysis identifies a significant deviation from the Benford distribution for the 37 values declared in 2014. Actually, 6 of them were checked by Custom authority, and the corresponding imports resulted to be under-valuated.
In conclusion, the availability of suitable and powerful procedures for testing BL compliance in small samples can efficiently support the Custom anti-fraud activities. It allows not only to extend the range of applicability through the possibility of use them for traders with few operations, but also to identify small subsets of imports that deserve further investigations.

Conclusion
The popularity of the BL is remarkably increasing in these last years. Since a set of numbers is expected to be Benford compliant under very general conditions, it can be used in many applications where the target is to identify potentially manipulated figures, as, for example, antifraud investigations. The automatic assessment of BL conformity through statistical methods requires testing procedures with desirable statistical properties. Especially in audit where many samples are tested, we aim for a statistical test which controls the number of false alarms and guarantees at the same time a suitable level of power. The choice of the Benford compliance test is even more important when the number of observations in the sample is small, given that suitable level of powers are very difficult to achieve.
The aim of this article was first of all to provide an extensive analysis of the performances of several BL compliance tests in very small samples. Simulation results proved that small sample properties can significantly vary, depending on the alternative scenario. In general, every test alternates good and average results across all the alternatives and there is no a procedure that strictly dominates the others. However, the following regularities emerged from the simulation exercise: • despite its popularity, the power of Pearson's χ 2 test was often below average; • KU s and U 2 achieved outstanding results in most of the alternative scenarios considered, but not when the FSD is uniformly distributed.
The need of a procedure with optimal results independently of the alternative digits pattern encouraged us to define a new test. The main idea was to merge the positive small sample properties offered by some of the tests in the simulation experiment. Therefore we proposed to combines the p-values of KU s , U 2 and A 2 through the min function. The resulting test achieved always desirable levels of powers in all alternative designs, even when one or two single tests performed below the average.
Finally, an empirical application presented a practical case where the availability of a reliable testing procedure for small samples allows (i) to increase the number of samples under investigation; and (ii) a more accurate selection of the cases that are suspected of data manipulation. This second issue is particularly relevant in anti-fraud audits. Non-conformity to the BL does not necessarily mean that the corresponding economic operator is a fraudster. Further, sometimes expensive, investigations are required to prove the irregularities. Therefore, having the possibility to focus the attention on a restricted number of suspect cases together with a reliable control over the number of false alarms are two essential requirements for an affordable anti-fraud analysis.